By: Tien Doan, Iris Hu, Shannen Lam
Every December, Spotify releases "Spotify Wrapped," a summary of your most played songs, accompanied by a playlist filled with those songs. This year, Spotify added on "Your Decade Wrapped," a series of playlists summarizing a user's top songs of the 2010s. Earlier in November, Billboard also released their "100 Songs That Defined the 2010s."
This had our group curious on song trends over the course of time. Is there a correlation between all the songs that have been popular? Are louder songs more successful? Have artists perfected some kind of formula to guarantee their songs' success?
So, using Spotify's "Get Audio Features for a Track" API, we analyzed the traits of songs from the Billboard Hot 100 charts over the course of about a year (Jan 2019 to Nov 2019) in order to see if there really is an overarching trend that popular songs follow.
We will need the following:
import pandas as pd
import billboard
import datetime
import seaborn as sns
import time
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import plotly.express as px
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
The first two steps in the data life cycle are data collection and data preprocessing. Since we need both Spotify audio features and Billboard chart data, we focus on finding APIs and libraries to retrieve that information.
Below, we use a convenient billboard.py library to retrieve Billboard Hot 100 chart data for the past year. Unfortunately, retrieving the entire year's data rate limits us, so we break it up into 2 month chunks with some sleeping in between to give the API a break.
We store each week's chart into a larger charts array for later processing.
charts = []
year = 2019
begin_year = datetime.date(year, 1, 1)
end_year = datetime.date(year, 3, 1)
one_week = datetime.timedelta(weeks=1)
next_week = begin_year
while next_week < end_year:
date_str = next_week.strftime("%Y-%m-%d")
chart = billboard.ChartData('hot-100', date=date_str)
charts.append(chart)
next_week += one_week
print('Finished 1-1 through 3-1')
time.sleep(100)
end_year = datetime.date(year, 5, 1)
while next_week < end_year:
date_str = next_week.strftime("%Y-%m-%d")
chart = billboard.ChartData('hot-100', date=date_str)
charts.append(chart)
next_week += one_week
print('Finished 3-1 through 5-1')
time.sleep(100)
end_year = datetime.date(year, 7, 1)
while next_week < end_year:
date_str = next_week.strftime("%Y-%m-%d")
chart = billboard.ChartData('hot-100', date=date_str)
charts.append(chart)
next_week += one_week
print('Finished 5-1 through 7-1')
time.sleep(100)
end_year = datetime.date(year, 9, 1)
while next_week < end_year:
date_str = next_week.strftime("%Y-%m-%d")
chart = billboard.ChartData('hot-100', date=date_str)
charts.append(chart)
next_week += one_week
print('Finished 7-1 through 9-1')
time.sleep(100)
end_year = datetime.date(year, 11, 1)
while next_week < end_year:
date_str = next_week.strftime("%Y-%m-%d")
chart = billboard.ChartData('hot-100', date=date_str)
charts.append(chart)
next_week += one_week
print('Finished 9-1 through 11-1')
This is how many weeks of chart data we have!
print(len(charts))
Now that we have all the Billboard chart data that we need, it's time to grab the audio features for all of the songs in those charts.
Below, we import a nice Python library for the Spotify Web API: Spotipy. In order to retrieve audio features for a track, we first need a track URI, URL, or ID. Unfortunately, the Billboard chart data does not give us that information, so we're going to have to get it ourselves.
The Spotify Web API provides an endpoint called search that lets you query their database for track URIs. The only thing is, the query has to be formatted a specific way: "artist:[artist_name] track:[song_title]", so we have to create those query strings. Later on, we'll also need the song title, artist, as well as its rank on the Hot 100 chart, so we'll also keep track of those in a tuple.
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
client_id = %env CLIENT_ID
client_sec = %env CLIENT_SECRET
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_sec)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
hot_100_tracks = []
for chart in charts:
for song in chart:
hot_100_tracks.append(('artist:' + song.artist.split()[0] + ' track:' + song.title, song.title, song.artist, song.rank))
print(len(hot_100_tracks))
So we have our query strings formatted correctly for all the songs! You may have noticed that we only took the first word in the artist's name. In the next step, actually searching for the track, we found that a lot of songs from the Billboard charts had artist names that did not match the format that Spotify used (e.g. featurings like "Chris Brown Featuring Drake"). So a lot of songs failed to find a match. Song name formatting rarely varies between platforms, so by keeping that and chopping the rest of the artist's name, we matched a lot more songs to their Spotify track URIs. (As you can see below, we only had 71 failed matches out of 4400 total searches.)
The Spotify Web API also rate limited, so after 100 queries (the number of queries before they rate limited), we slept for 10 seconds before doing the next 100 songs.
hot_100_track_uris = []
failed_search = []
number_queries = 0
for track, song, artist, rank in hot_100_tracks:
track_id = sp.search(q=track, type='track', limit=1)
if(len(track_id['tracks']['items']) > 0):
hot_100_track_uris.append((track_id['tracks']['items'][0]['uri'], song, artist, rank))
else:
failed_search.append(track_id)
number_queries += 1
if number_queries % 100 == 0:
time.sleep(10)
print(failed_search)
print('Completed!')
print(len(failed_search))
print(len(hot_100_track_uris))
Finally, we are ready to query for what we were after from the beginning: audio features. Each audio feature query returns and Audio Features object which we store away for later use. Again, we must sleep in order to not spam the API.
hot_100_audio_features = []
num_queries = 0
for uri, song, artist, rank in hot_100_track_uris:
af = sp.audio_features(tracks=[uri])[0]
hot_100_audio_features.append((af, song, artist, rank))
num_queries += 1
if num_queries % 100 == 0:
time.sleep(10)
print('Completed!')
We have: audio features and Billboard Hot 100 chart positions for every song. Now it's finally time to throw it all into a dataframe for analysis and visualization!
The audio features we chose to look closer at are: tempo, valence, loudness, energy, and key. We believe those were the most important features among the full suite of feature information we got: {duration_ms, key, mode, time_signature, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence, tempo}.
tracks = []
for af, song, artist, rank in hot_100_audio_features:
tracks.append((song, artist, af['tempo'], af['valence'], af['loudness'], af['energy'], af['key'], rank))
hot_100_df = pd.DataFrame(tracks, columns=['Song', 'Artist', 'Tempo', 'Valence', 'Loudness', 'Energy', 'Key', 'Rank'])
hot_100_df
hot_100_df = pd.read_csv('hot_100.csv')
Now that we have all the data, let's clean this up a bit more.
How many of these 4329 songs are actually unique songs? Songs can stay on the Billboard Hot 100 for several weeks!
print(hot_100_df.Song.nunique())
There are only 519 unique songs. Cleaning this data can help speed up our analysis.
For each duplicate song, we'll keep its highest ranking on the Hot 100 and create a "Count" column, keeping track of how many total weeks it has been on the Hot 100 since the beginning of 2019 (consecutive-ness of the weeks doesn't matter to us).
# Sorts all the songs by rank, then removes duplicates
unique_songs_df = hot_100_df.sort_values('Rank', ascending=True).drop_duplicates(['Song','Artist'])
# Iterate over all the songs and add how many times they occur
# in the Hot100 from Jan 2019 - Nov 2019
for song_name, count in hot_100_df.Song.value_counts().iteritems():
unique_songs_df.loc[unique_songs_df['Song'] == song_name, 'Count'] = count
unique_songs_df.sort_values('Count', ascending=False)
Now that all the data has been collected and tidied up a bit, we want to see if certain song qualities affect a song's Hot 100 ranking. We decided to look at the following traits: tempo, valence, loudness, energy, and (musical) key. We chose these because we felt like popular music has gotten noisier and faster over the years (the rise of EDM and rave cultures) and wanted to see if that was actually true.
First, we'll look at tempo and whether the Hot 100 rank of a song affects the tempo.
Tempo is measured in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. In other words, the higher the BPM, the faster the song is.
Let's look at a scatterplot and if we can observe anything.
# Sets size for the rest of our figures (i.e. graphs)
sns.set(rc={'figure.figsize':(25,10)})
tempo_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Tempo', marker='.', c='black')
tempo_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs Tempo(Jan 2019 -- Nov 2019)\n')
tempo_rank_plot.set_xlabel('Rank')
tempo_rank_plot.set_ylabel('Tempo (BPM)')
plt.show()
This graph... looks like a mess. It doesn't give me hope that the others will look significantly nicer. Let's graph it (and the following ones) as a violin plot and see if that helps.
sns.violinplot(x=unique_songs_df['Rank'], y=unique_songs_df['Tempo'])
Okay... Looks mildly better. At least this way, we can tell that most of the violins are unimodal. We should note that the violins that look like a line indicate that the given rank only has songs of one tempo. This applies to future graphs as well, but with the corresponding trait.
Here are some observations we have:
So, let's apply a linear regression to see if there's any strong trend to be found. We used sklearn to make this model. You can read more about it here.
%matplotlib inline
# Reshape the Rank and Tempo to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Tempo'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Tempo(Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Temp (BPM)')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
With a regression model slope of 0.11, we can say that there is almost no correlation between tempo and rank of a song.
Maybe there's a most popular tempo? Let's graph the usage of tempos and see! Here, we're using plotly bar graphs to make the graphs interactive so we can actually read the exact data values.
# Group the data by tempo
group_tempo = unique_songs_df.groupby('Tempo')
# Will hold tempo to the number of songs with that tempo
tempo_count_dict = {}
# for every tempo group, count the number of songs in
# that key, round the tempo float to the nearest integer,
# then add it to tempo_count_dict
for name, group in group_tempo:
if int(round(name)) in tempo_count_dict:
tempo_count_dict[int(round(name))] = tempo_count_dict[int(round(name))] + len(group.index)
else:
tempo_count_dict[int(round(name))] = len(group.index)
tempo_count_dict
# Create a bar graph of tempos and how often they're used in the songs
df = pd.DataFrame(list(zip(tempo_count_dict.keys(), tempo_count_dict.values())), columns =['Tempo', 'Count'])
fig = px.bar(df, x='Tempo', y='Count', labels={'x':'Tempo (BPM)', 'y':'Count'})
fig.show()
While there is no correlation between faster songs being more popular, there are certainly more popular tempos in the Hot 100. The most used seem to be 95, 100, 120, 130, 140, 150, and 160 BPM. They all are multiples of 5, which is interesting to note.
It seems like most composers whose songs have made it on the Hot 100 don't like using tempos that are more in-between. Also, visually, it looks like the more popular tempos are around 90-110 BPM and 140-160 BPM. Not a lot of popular songs are faster than 180 BPM or slower than 75 BPM.
Next, let's look at the effects of the Hot 100 rank of a song on valence.
According to Spotify:
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Let's see if people like happy or sad music more. First, the scatterplot and the violin plot!
# Sets size for the rest of our figures (i.e. graphs)
sns.set(rc={'figure.figsize':(25,10)})
valence_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Valence', marker='.', c='black')
valence_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Valence(Jan 2019 -- Nov 2019)\n')
valence_rank_plot.set_xlabel('Rank')
valence_rank_plot.set_ylabel('Valence')
plt.show()
sns.violinplot(x=unique_songs_df['Rank'], y=unique_songs_df['Valence'])
Again, nothing in particular stands out here. Let's try linear regression to see if that shows us anything.
# Reshape the Rank and Valence to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Valence'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Valence (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Valence')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
This regression slope of -0.00077 shows an even weaker correlation between happiness of a song and song popularity than the one with tempo.
Now, we should look at the frequency distribution of the valences.
# Group the data by valance
group_valence = unique_songs_df.groupby('Valence')
# Will hold valence to the number of songs with that valence
valence_count_dict = {}
# for every valence group, count the number of songs in
# that key, round the valence float to the nearest tenth,
# then add it to valence_count_dict
for name, group in group_valence:
if round(name, 1) in valence_count_dict:
valence_count_dict[round(name, 1)] = valence_count_dict[round(name, 1)] + len(group.index)
else:
valence_count_dict[round(name, 1)] = len(group.index)
# Create a bar graph of valences and how often they're used in the songs
df = pd.DataFrame(list(zip(valence_count_dict.keys(), valence_count_dict.values())), columns =['Valence', 'Count'])
fig = px.bar(df, x='Valence', y='Count', labels={'x':'Valence', 'y':'Count'})
fig.show()
This valence distrubution doesn't really tell us much. This is because the given distribution of valences for songs provided by Spotify looks like the following:

Our valence bar graph seems to reflect that distribution, having similar peaks at 0.4 and 0.6, so we can't draw any conclusions from this.
Maybe loudness affects the popularity of a song?
Loudness is measured in decibels (dB). Spotify says that "Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude)." The values typically range between -60 and 0 dB.
# Sets size for the rest of our figures (i.e. graphs)
sns.set(rc={'figure.figsize':(25,10)})
loudness_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Loudness', marker='.', c='black')
loudness_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Loudness (Jan 2019 -- Nov 2019)\n')
loudness_rank_plot.set_xlabel('Rank')
loudness_rank_plot.set_ylabel('Loudness (dB)')
plt.show()
sns.violinplot(x=unique_songs_df['Rank'], y=unique_songs_df['Loudness'])
We can observe that most songs don't drop under -20 dB. However, like before, we're not really seeing anything in particular standing out. There is a lot of skewing happening in the violin plots, though.
This doesn't have me hopeful for the regression line, but let's run a linear regression analysis anyway just to see.
# Reshape the Rank and Loudness to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Loudness'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Loudness (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Loudness')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Yep, the linear regression model tells us nothing much. With a regression slope of 0.001, there is almost no correlation here.
How about the frequency bar graph?
# Group the data by loudness
group_loudness = unique_songs_df.groupby('Loudness')
# Will hold loudness to the number of songs with that tempo
loudness_count_dict = {}
# for every loudness group, count the number of songs in
# that key, round the loudness float to the nearest tenth,
# then add it to loudness_count_dict
for name, group in group_loudness:
if round(name, 1) in loudness_count_dict:
loudness_count_dict[round(name, 1)] = loudness_count_dict[round(name, 1)] + len(group.index)
else:
loudness_count_dict[round(name, 1)] = len(group.index)
# Create a bar graph of loudness and how often they're used in the songs
df = pd.DataFrame(list(zip(loudness_count_dict.keys(), loudness_count_dict.values())), columns =['Loudness', 'Count'])
fig = px.bar(df, x='Loudness', y='Count', labels={'x':'Loudness (dB)', 'y':'Count'})
fig.show()
This is what the given distribution of loudness for songs provided by Spotify looks like:

In Spotify's distribution, there is only one peak around the -10 to -5 dB range, ending right before -10. In our distribution, there seems to be about two peaks: one in the -5.8 to -4.5 range and a smaller peak in the -7.2 to -6.3 range. This reflects the Spotify distribution. So, like the valence bar graph, no conclusions can be drawn from this.
Can the Hot 100 rank affect energy?
Spotify describes energy as so:
Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
The higher the number, the more "energetic" the song is.
# Sets size for the rest of our figures (i.e. graphs)
sns.set(rc={'figure.figsize':(25,10)})
energy_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Energy', marker='.', c='black')
energy_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Energy (Jan 2019 -- Nov 2019)\n')
energy_rank_plot.set_xlabel('Rank')
energy_rank_plot.set_ylabel('Energy')
plt.show()
sns.violinplot(x=unique_songs_df['Rank'], y=unique_songs_df['Energy'])
Again, consistent with the above traits in that the scatter and violin plots tell us nothing useful. Let's look at the linear regression.
# Reshape the Rank and Energy to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Energy'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Energy (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Energy')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
This is our weakest correlation yet. The regression slope is 6.61e-05. It is so incredibly close to 0 that it's safe to say that ranking does not directly correlate with the energy of a song.
Maybe the frequency distribution will show us something.
# Group the data by energy
group_energy = unique_songs_df.groupby('Energy')
# Will hold energy to the number of songs with that energy
energy_count_dict = {}
# for every energy group, count the number of songs in
# that key, round the energy float to the nearest integer,
# then add it to energy_count_dict
for name, group in group_energy:
if round(name, 1) in energy_count_dict:
energy_count_dict[round(name, 1)] = energy_count_dict[round(name, 1)] + len(group.index)
else:
energy_count_dict[round(name, 1)] = len(group.index)
# Create a bar graph of energy and how often they're used in the songs
df = pd.DataFrame(list(zip(energy_count_dict.keys(), energy_count_dict.values())), columns =['Energy', 'Count'])
fig = px.bar(df, x='Energy', y='Count', labels={'x':'Energy', 'y':'Count'})
fig.show()
This is the Spotify distribution of energies in songs:

Comparing our distribution to the Spotify one, there are some noticeable differences. Our distribution seems to peak at the 0.6 and 0.7 ranges, rather than the 0.8 area. This could potentially indicate that popular songs occur more in the middle to middle-high energy songs. Also, there are less songs in our distribution in the 0.8 to 1.0 range than in the Spotify one, perhaps indicating that songs that are too energetic are less popular and will not place on the Billboard Hot 100.
None of the other traits seem to correlate that strongly with song rankings. How about the musical key of a song? Maybe some keys are more popular than others?
Spotify estimates the overall key of a track. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
music_key_rank_plot = unique_songs_df.plot.scatter(x='Rank', y='Key', marker='.', c='black')
music_key_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Musical Key (Jan 2019 -- Nov 2019)\n')
music_key_rank_plot.set_xlabel('Rank')
music_key_rank_plot.set_ylabel('Musical Key')
plt.show()
sns.violinplot(x=hot_100_df['Rank'], y=hot_100_df['Key'])
# Reshape the Rank and Key to be used in the linear regression
X = unique_songs_df['Rank'].values.reshape(-1,1)
y = unique_songs_df['Key'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Key (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Key')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Okay, so higher keys don't correlate with the popularity of a song. How about if we see how many songs are in which key? Maybe some keys are more popular than others.
# Group the data by musical key
group_musc_key = unique_songs_df.groupby('Key')
# Will hold musical key to the number of songs in that key
musc_key_count_dict = {}
# Integers map to pitches using standard Pitch Class notation
musc_key_dict = {0:'C', 1:'C♯/D♭', 2:'D', 3:'D♯/E♭', 4:'E', 5:'F', 6:'F♯/G♭', 7:'G', 8:'G♯/A♭', 9:'A', 10:'A♯/B♭', 11:'B'}
# for every musical key group, count the number of songs in
# that key, then add it to musc_key_count_dict
for name, group in group_musc_key:
musc_key_count_dict[musc_key_dict[name]] = len(group.index)
# Create a bar graph of music keys and how often they're used in the songs
df = pd.DataFrame(list(zip(musc_key_dict.values(), musc_key_count_dict.values())), columns =['Music Key', 'Count'])
fig = px.bar(df, x='Music Key', y='Count', labels={'x':'Music Key', 'y':'Count'})
fig.show()
Wow! So there is definitely a most popular and least popular key among the Hot 100 songs. The most popular key is C♯/D♭ and the least popular one is D♯/E♭.
Note: The reason why there are two names for the same key is because they're referred to by two different names, depending on whether the overarching key is major or minor.
Since there's no strong correlation linking any of the song traits we looked at with song ranking, we are unable to draw many conclusions about the relationship between the traits and Hot 100 rankings that we can do predictive analyis on. We could only find some commonalities like most popular music key and popular tempos. So, we'll look into other means of analysis.
We did not find strong correlations between rank and song traits, so let's analyze the 50 songs that spent the most time on the Hot 100 and see if we can find any trends in their song traits.
top_songs_df = unique_songs_df.sort_values('Count', ascending=False).head(50)
top_songs_df
tempo_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Tempo', marker='.', c='black')
tempo_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs Tempo (Top 50)\n')
tempo_rank_plot.set_xlabel('Rank')
tempo_rank_plot.set_ylabel('Tempo (BPM)')
plt.show()
Once again, the scatterplot doesn't really say much.
sns.set(rc={'figure.figsize':(25,10)})
sns.violinplot(x=top_songs_df['Rank'], y=top_songs_df['Tempo'])
# Reshape the year and lifeExp to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Tempo'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Tempo(Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Temp (BPM)')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
With a regression model slope of 0.33, there is a higher correlation within the top 50 than there was with our entire dataset, but it still isn't very strong.
# Group the data by musical key
group_tempo = top_songs_df.groupby('Tempo')
# Will hold tempo to the number of songs with that tempo
tempo_count_dict = {}
# for every tempo group, count the number of songs in
# that key, round the tempo float to the nearest integer,
# then add it to tempo_count_dict
for name, group in group_tempo:
if int(round(name)) in tempo_count_dict:
tempo_count_dict[int(round(name))] = tempo_count_dict[int(round(name))] + len(group.index)
else:
tempo_count_dict[int(round(name))] = len(group.index)
tempo_count_dict
# Create a bar graph of tempos and how often they're used in the songs
df = pd.DataFrame(list(zip(tempo_count_dict.keys(), tempo_count_dict.values())), columns =['Tempo (BPM)', 'Count'])
fig = px.bar(df, x='Tempo (BPM)', y='Count', labels={'x':'Tempo (BPM)', 'y':'Count'})
fig.show()
Similar to the results from the entire dataset, there is a visual increase in popularity from 96-103 BPM and from 124-136 BPM. Most significant is the 4 songs out of the top 50 which have a BPM of 136, which accounts for 8% of the most popular songs. There are also only 2 songs with a BPM higher than 159; one is at 168, and the other is much more of an outlier at 202 BPM.
valence_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Valence', marker='.', c='black')
valence_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Valence (Top 50)\n')
valence_rank_plot.set_xlabel('Rank')
valence_rank_plot.set_ylabel('Valence')
plt.show()
sns.violinplot(x=top_songs_df['Rank'], y=top_songs_df['Valence'])
# Reshape the Rank and Valence to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Valence'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Valence (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Valence')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
Again, the relationship between song ranking and valence is very weak.
# Group the data by valance
group_valence = top_songs_df.groupby('Valence')
# Will hold valence to the number of songs with that valence
valence_count_dict = {}
# for every valence group, count the number of songs in
# that key, round the valence float to the nearest tenth,
# then add it to valence_count_dict
for name, group in group_valence:
if round(name, 1) in valence_count_dict:
valence_count_dict[round(name, 1)] = valence_count_dict[round(name, 1)] + len(group.index)
else:
valence_count_dict[round(name, 1)] = len(group.index)
# Create a bar graph of valences and how often they're used in the songs
df = pd.DataFrame(list(zip(valence_count_dict.keys(), valence_count_dict.values())), columns =['Valence', 'Count'])
fig = px.bar(df, x='Valence', y='Count', labels={'x':'Valence', 'y':'Count'})
fig.show()
Here, we can see a slight trend towards mid to lower valences, meaning songs that are neutral or slightly positive stay the longest on the chart.
loudness_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Loudness', marker='.', c='black')
loudness_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Loudness (Jan 2019 -- Nov 2019)\n')
loudness_rank_plot.set_xlabel('Rank')
loudness_rank_plot.set_ylabel('Loudness (dB)')
plt.show()
sns.violinplot(x=top_songs_df['Rank'], y=top_songs_df['Loudness'])
# Reshape the Rank and Loudness to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Loudness'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Loudness (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Loudness')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
The regression model tells us that there is still no correlation.
# Group the data by loudness
group_loudness = top_songs_df.groupby('Loudness')
# Will hold loudness to the number of songs with that tempo
loudness_count_dict = {}
# for every loudness group, count the number of songs in
# that key, round the loudness float to the nearest tenth,
# then add it to loudness_count_dict
for name, group in group_loudness:
if round(name, 1) in loudness_count_dict:
loudness_count_dict[round(name, 1)] = loudness_count_dict[round(name, 1)] + len(group.index)
else:
loudness_count_dict[round(name, 1)] = len(group.index)
# Create a bar graph of loudness and how often they're used in the songs
df = pd.DataFrame(list(zip(loudness_count_dict.keys(), loudness_count_dict.values())), columns =['Loudness', 'Count'])
fig = px.bar(df, x='Loudness', y='Count', labels={'x':'Loudness (dB)', 'y':'Count'})
fig.show()
Of the top 50 songs, the majority have a loudness above -8. This is still similar to Spotify's distribution.
energy_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Energy', marker='.', c='black')
energy_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Energy (Top 50)\n')
energy_rank_plot.set_xlabel('Rank')
energy_rank_plot.set_ylabel('Energy')
plt.show()
sns.violinplot(x=top_songs_df['Rank'], y=top_songs_df['Energy'])
# Reshape the Rank and Energy to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Energy'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Energy (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Energy')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
# Group the data by energy
group_energy = top_songs_df.groupby('Energy')
# Will hold energy to the number of songs with that energy
energy_count_dict = {}
# for every energy group, count the number of songs in
# that key, round the energy float to the nearest integer,
# then add it to energy_count_dict
for name, group in group_energy:
if round(name, 1) in energy_count_dict:
energy_count_dict[round(name, 1)] = energy_count_dict[round(name, 1)] + len(group.index)
else:
energy_count_dict[round(name, 1)] = len(group.index)
# Create a bar graph of energy and how often they're used in the songs
df = pd.DataFrame(list(zip(energy_count_dict.keys(), energy_count_dict.values())), columns =['Energy', 'Count'])
fig = px.bar(df, x='Energy', y='Count', labels={'x':'Energy', 'y':'Count'})
fig.show()
This looks similar to our original energy bar graph, with a peak at 0.7 and large drop at 0.8. This could mean that middle to middle-high energy songs are indeed the most popular.
music_key_rank_plot = top_songs_df.plot.scatter(x='Rank', y='Key', marker='.', c='black')
music_key_rank_plot.set_title('Billboard Hot 100 Rank of a Song vs. Musical Key (Top 50)\n')
music_key_rank_plot.set_xlabel('Rank')
music_key_rank_plot.set_ylabel('Musical Key')
plt.show()
sns.violinplot(x=top_songs_df['Rank'], y=hot_100_df['Key'])
# Reshape the Rank and Key to be used in the linear regression
X = top_songs_df['Rank'].values.reshape(-1,1)
y = top_songs_df['Key'].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
# Plot the data and the regression line
plt.scatter(X, y, color = 'black')
plt.plot(X, regressor.predict(X), color = 'red')
plt.title('Billboard Hot 100 Rank of a Song vs Key (Jan 2019 -- Nov 2019)\n')
plt.xlabel('Rank')
plt.ylabel('Key')
plt.show()
#To retrieve the intercept:
print('Regression model intercept:')
print(regressor.intercept_)
#For retrieving the slope:
print('Regression model slope:')
print(regressor.coef_)
# Group the data by musical key
group_musc_key = top_songs_df.groupby('Key')
# Will hold musical key to the number of songs in that key
musc_key_count_dict = {}
# Integers map to pitches using standard Pitch Class notation
musc_key_dict = {0:'C', 1:'C♯/D♭', 2:'D', 3:'D♯/E♭', 4:'E', 5:'F', 6:'F♯/G♭', 7:'G', 8:'G♯/A♭', 9:'A', 10:'A♯/B♭', 11:'B'}
# for every musical key group, count the number of songs in
# that key, then add it to musc_key_count_dict
for name, group in group_musc_key:
musc_key_count_dict[musc_key_dict[name]] = len(group.index)
# Create a bar graph of music keys and how often they're used in the songs
df = pd.DataFrame(list(zip(musc_key_dict.values(), musc_key_count_dict.values())), columns =['Music Key', 'Count'])
fig = px.bar(df, x='Music Key', y='Count', labels={'x':'Music Key', 'y':'Count'})
fig.show()
Like with the larger dataset, C♯/D♭ is the most popular key, but this time by a much larger margin in comparison. The least popular one is still D♯/E♭, and A is significantly low.
Among the to 50 songs, we can see that C♯/D♭ is the most popular key, while D♯/E♭ and A are the least popular. Valences generally were between 0.3 and 0.6, denoting a slightly positive or neutral sound. Middle to middle-high energy levels, from 0.4 to 0.7, are the most popular. Most of the top 50 had a loudness from -8 to -2.7 dB. Even most of these do not deviate significantly from Spotify's aggregate data, they reflect the songs that people listen to the most.
So these individual audio features don't show that much correlation in the grand scheme of things. However, 100 songs is a lot of songs; there is definitely room for diversity in songs in enter the chart. Maybe finding the artists that constantly have top hits and analyzing their music would be helpful. We defined the most popular artists to be the ones that have the most songs in the Hot 100.
Below we found all the unique artist names in the Hot 100 charts, but we realized a lot of the artist names included featuring artists in the form "Artist1 Featuring Artist2", so we decided to just include that in the count of songs for "Artist1".
unique_artists = hot_100_df.Artist.unique()
# grab all unique artists, disregard 'Featuring'
for i in range(0, len(unique_artists)):
if "Featuring" in unique_artists[i]:
idx = unique_artists[i].find('Featuring')
unique_artists[i] = unique_artists[i][:idx-1]
unique_artists = list(set(unique_artists))
After getting the unique artists, we need to count the number of songs they have in the Hot 100 charts to determine the most popular artists.
# count how many songs each artist has had on the Hot 100 this year
artist_songs_in_chart = {}
for artist in unique_artists:
song_count = len(unique_songs_df[unique_songs_df['Artist'].str.contains(artist)])
artist_songs_in_chart[artist] = song_count
# sort list of artists in descending order for # of songs in Hot 100 over the year
artist_songs_in_chart_sorted = sorted(artist_songs_in_chart.items(), key=lambda kv: kv[1], reverse=True)
# select the top 5 artists to look at
selected_artists = []
for i in range(0, 5):
selected_artists.append(artist_songs_in_chart_sorted[i][0])
print(artist_songs_in_chart_sorted[i][0] + ": " + str(artist_songs_in_chart_sorted[i][1]) + " songs")
Sorting by number of unique songs in the charts gives us the above 5 top artists!
So now we can analyze the features of their songs. Below we plotted each of the artists' songs' ranks in relation to the five different audio features and added a regression line to see general trend.
X = unique_songs_df['Rank'].values.reshape(-1,1)
audio_features = ['Tempo', 'Valence', 'Loudness', 'Energy', 'Key']
for artist in selected_artists:
artist_songs_df = unique_songs_df[unique_songs_df['Artist'].str.contains(artist)]
fig = plt.figure()
fig.subplots_adjust(hspace=.4, wspace=.4)
fig.suptitle(artist + ": Billboard Hot 100 Rank of a Song vs. Audio Features (Jan 2019 -- Nov 2019)\n")
for i in range(0, len(audio_features)):
af = audio_features[i]
y = unique_songs_df[af].values.reshape(-1,1)
# Create Linear Regression based on plot
regressor = LinearRegression()
regressor.fit(X, y)
ax = fig.add_subplot(2, 3, i+1)
ax.scatter(artist_songs_df['Rank'], artist_songs_df[af], color = 'black')
ax.plot(X, regressor.predict(X), color = 'red')
ax.set_xlabel('Rank')
ax.set_ylabel(af)
plt.show()
From the graphs, we can see that there is a very slight trend with tempo and valence. More specifically, in these top 5 artists of 2019, their slower tempo songs tended to rank higher, and the songs with more valence also rank higher. All things considered, however, the tempo of these artists' songs are not that slow, just that the relatively slower ones tend to rank higher. Valence also makes sense in this context since higher valence means more positive songs, which means that people tend to like the more positive songs of these artists. There does not seem to be much correlation between rank and the other audio features.
Since the "key" feature has fixed values, why don't we take a look at what keys these artists release most of their hits in?
for artist in selected_artists:
artist_songs_df = unique_songs_df[unique_songs_df['Artist'].str.contains(artist)]
af = 'Key'
groupby_key = artist_songs_df.groupby(af)
names = []
values = []
for name, group in groupby_key:
names.append(name)
values.append(len(group))
plt.bar(names, values, align='center')
plt.title(artist + ": Billboard Hot 100 Key Usage in Songs (Jan 2019 -- Nov 2019)\n")
plt.xlabel(af)
plt.ylabel('Number of Occurrences')
plt.show()
Wow! For four out of the five artists, keys 0 and 1 (C and C♯/D♭) were used the most. This could be interpreted as the keys that the public is most receptive to. However, Billie Eilish clearly has had success with most of her Hot 100 hits being in the 4th key: E.
So how does this relate to all of the other songs in the Hot 100? Mentioned earlier in this tutorial, the most popular key among all the songs is C♯/D♭ which corresponds to what we have found here in the most popular artists. D♯/E♭ (key 3) was found to be the least popular key, and that is also reflected here: most artists did not release/chart a song in key 3 or, if they did, there were very few of them.
From our data exploration and analysis, we have unfortunately found out that there is not much correlation between the audio features of a song and how it ranks on the Billboard Hot 100 charts. The graphs of each audio feature vs. rank show a pretty random and even distribution of values, so we cannot conclude anything from the data. However, when we looked at general count of certain audio feature values, we found that there were some values that took the lead.
For musical key, the data showed that C♯/D♭ is by far the most popular key for songs that made it into the Hot 100 charts. In addition, D♯/E♭ turned out to be the least used key among the charting songs. From this, it could go several ways.
Without a lot of musical knowledge, it is hard to make conclusions about the last two bullet points. But, we did find a pie chart of all the keys used in Spotify songs:

http://thekeyofone.com/blog/the-most-and-least-used-keys-of-music-in-spotify/
This shows us that the most used key is in fact G. C♯ is not even close to being the most used key in all of Spotify. Granted, this chart is from 2016, so there could be changing musical trends, but we believe this is a valid argument that it is not necessarily the easiest to make songs in C♯/D♭.
In order to take into account skewing of data from less popular songs and artists (therefore less telling of general taste), we looked at both the songs that stayed in the Hot 100 chart this year the longest and also the songs of the most popular artists.
Looking at the top 50 songs that stayed in the chart the longest, we see that C♯/D♭ is also the most used key among them and D♯/E♭ and A are least used. We also see that middle to middle-high energy levels, from 0.4 to 0.7, are the most popular. Most of the top 50 had a loudness from -8 to -2.7 dB. There is a preference for songs that are neutral to slightly positive.
When analyzing the songs of the most popular artists, we found the trends to be the same as the trends in all of the Hot 100. The musical key distribution reflected the same trends of popular C♯/D♭ and least popular D♯/E♭.
For future analysis of songs, we would definitely be interested in exploring these same trends over the course of the whole decade--Jan 2010 to December 2019--to see if anything has actually changed musically over time. Additionally, it would be interesting to see if there are any trends in the musical qualities of songs depending on which season they were released in. For example, are winter song releases less energetic than summer releases?
Something we would also like to be able to do is be able to apply some kind of machine learning model to a larger data set to see if we are able to predict a song's Billboard Hot 100 placement based on its sound qualities or its artists' prior performance on the charts. If a successful model could be made, we would also like to see if that could also accurately predict foreign artists' success on the Hot 100, e.g. "Despacito" by Luis Fonsi and Daddy Yankee or "Boy With Luv" by BTS ft. Halsey.